Part I - Data Exploration: Ford GoBike Trip data¶

by Ketty Muwowo¶

Table of Contents¶

  • Introduction
  • Univariate Exploration
  • Bivariate Exploration
  • Multivariate Exploration
  • Conclusions

Introduction¶

The Ford GoBike trip dataset, is the one we have chosen for exploration in this project. It includes individual information about the rides made in a bike-sharing system covering the greater San-Franciso Bay area.The dataset contains trip data for February 2019. Our primary goal is to systematically explore this selected dataset, starting from plots of single variables and building up to plots of multiple variables. We will then produce a short presentation which illustrates some properties, trends, and relationships that we will discover from the dataset.

The following are the questions of consideration in our exploration.

Questions¶

Univariate Exploration¶

  1. What is the distribution for members' ages?

  2. What is the distribution for member_gender and user_type features?

  3. What is the distribution for the trip duration in minutes?

  4. During what period of the day are more trips likely to be booked?

  5. What are the top 5 popular start stations for the trips taken?

Bivariate Exploration¶

  1. For each gender, how long, in minutes, does the trip last?

  2. Is there any correlation between Member age and duration of trip?

  3. What is the relationship between age and user_type?

  4. What is the relationship between the 3 categorical variables period_ofday, user_type and member_gender?

Multivariate Exploration¶

  1. For each period of the day, what is the average trip duration in minutes for each user type?

  2. What is the relationship between member_gender, age and duration_min?

  3. How closely correlated are the different variables in the dataset?

  4. What is the relationship between member_gender, age and user_type?

Preliminary Wrangling¶

In [1]:
# importing all the necessary packages and set plots to be embedded inline.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import calendar
import plotly.express as px
import time
%matplotlib inline

Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.

In [2]:
# Loading the Ford Gobike dataset 
bike_df=pd.read_csv('201902-fordgobike-tripdata.csv')

# displaying the first 5 rows of the Ford GoBike dataset.
bike_df.head(5)
Out[2]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 52185 2019-02-28 17:32:10.1450 2019-03-01 08:01:55.9750 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984.0 Male No
1 42521 2019-02-28 18:53:21.7890 2019-03-01 06:42:03.0560 23.0 The Embarcadero at Steuart St 37.791464 -122.391034 81.0 Berry St at 4th St 37.775880 -122.393170 2535 Customer NaN NaN No
2 61854 2019-02-28 12:13:13.2180 2019-03-01 05:24:08.1460 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972.0 Male No
3 36490 2019-02-28 17:54:26.0100 2019-03-01 04:02:36.8420 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989.0 Other No
4 1585 2019-02-28 23:54:18.5490 2019-03-01 00:20:44.0740 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974.0 Male Yes
In [3]:
# displaying the shape of the dataset
print(bike_df.shape)
(183412, 16)

The dataset contains $183,412$ rows and $16$ columns.

In [4]:
# checking the datatypes of the dataset
print(bike_df.dtypes)
duration_sec                 int64
start_time                  object
end_time                    object
start_station_id           float64
start_station_name          object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             float64
end_station_name            object
end_station_latitude       float64
end_station_longitude      float64
bike_id                      int64
user_type                   object
member_birth_year          float64
member_gender               object
bike_share_for_all_trip     object
dtype: object

From the datatypes displayed above, we see that our dataset contains 3 types of datatypes namely:

  • int64

  • object

  • float64

We observe that some of the data types are inappropriate and have to be changed. The following are the changes to be made to the datatypes of the dataset.

  • start_time and end_time to be changed to datetime from object.
  • start_station_id and end_station_id are recorded as float64 and has to be changed to object datatype.
  • bike_id to be changed to object.
  • member_birth_year is recorded as float64, to be changed to int64.
  • user_type and member_gender to be changed to category datatye.
In [5]:
# checking some general information about the dataset.
bike_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             183412 non-null  int64  
 1   start_time               183412 non-null  object 
 2   end_time                 183412 non-null  object 
 3   start_station_id         183215 non-null  float64
 4   start_station_name       183215 non-null  object 
 5   start_station_latitude   183412 non-null  float64
 6   start_station_longitude  183412 non-null  float64
 7   end_station_id           183215 non-null  float64
 8   end_station_name         183215 non-null  object 
 9   end_station_latitude     183412 non-null  float64
 10  end_station_longitude    183412 non-null  float64
 11  bike_id                  183412 non-null  int64  
 12  user_type                183412 non-null  object 
 13  member_birth_year        175147 non-null  float64
 14  member_gender            175147 non-null  object 
 15  bike_share_for_all_trip  183412 non-null  object 
dtypes: float64(7), int64(2), object(7)
memory usage: 22.4+ MB

From the output above, we note that the data contains missing values.

In [6]:
# checking the features/columns with missing values.
bike_df.isna().sum()
Out[6]:
duration_sec                  0
start_time                    0
end_time                      0
start_station_id            197
start_station_name          197
start_station_latitude        0
start_station_longitude       0
end_station_id              197
end_station_name            197
end_station_latitude          0
end_station_longitude         0
bike_id                       0
user_type                     0
member_birth_year          8265
member_gender              8265
bike_share_for_all_trip       0
dtype: int64

It is observed that, start_station_id, start_station_name, end_station_id, end_station_name, member_birth_year and member_gender variables contains missing values.

To fix this, we will input $0$ to fill the missing values for numeric variables.

In [7]:
# inputing missing values with "0" for missing numerical variables
missval_col=['start_station_id','end_station_id','member_birth_year']

for i in missval_col:
    bike_df[i]=bike_df[i].fillna(bike_df[i].mode()[0])

 

For non-numerical variables like start_station_name and end_station_name we will fill the missing values with None.

In [8]:
# inputing missing values with "None" for missing object type of variables.
bike_df['start_station_name'].fillna("None",inplace=True)
bike_df['end_station_name'].fillna("None",inplace=True)
In [9]:
# For the member_gender column, We will fill the null values with "No Gender"

bike_df['member_gender'].fillna("No Gender",inplace=True)
In [10]:
# Checking if the missing values have been inputed.
bike_df.isna().sum()
Out[10]:
duration_sec               0
start_time                 0
end_time                   0
start_station_id           0
start_station_name         0
start_station_latitude     0
start_station_longitude    0
end_station_id             0
end_station_name           0
end_station_latitude       0
end_station_longitude      0
bike_id                    0
user_type                  0
member_birth_year          0
member_gender              0
bike_share_for_all_trip    0
dtype: int64
In [11]:
# Checking the descriptive statistics of the data.
bike_df.describe()
Out[11]:
duration_sec start_station_id start_station_latitude start_station_longitude end_station_id end_station_latitude end_station_longitude bike_id member_birth_year
count 183412.000000 183412.000000 183412.000000 183412.000000 183412.000000 183412.000000 183412.000000 183412.000000 183412.000000
mean 726.078435 138.503866 37.771223 -122.352664 136.174743 37.771427 -122.352250 4472.906375 1984.950347
std 1794.389780 111.750001 0.099581 0.117097 111.478306 0.099490 0.116673 1664.383394 9.908290
min 61.000000 3.000000 37.317298 -122.453704 3.000000 37.317298 -122.453704 11.000000 1878.000000
25% 325.000000 47.000000 37.770083 -122.412408 44.000000 37.770407 -122.411726 3777.000000 1981.000000
50% 514.000000 104.000000 37.780760 -122.398285 100.000000 37.781010 -122.398279 4958.000000 1988.000000
75% 796.000000 239.000000 37.797280 -122.286533 235.000000 37.797320 -122.288045 5502.000000 1992.000000
max 85444.000000 398.000000 37.880222 -121.874119 398.000000 37.880222 -121.874119 6645.000000 2001.000000
In [12]:
# changing the datatypes of certain selected features of interest to appropriate data types.

bike_df[['start_station_id','end_station_id','bike_id']]= bike_df[['start_station_id','end_station_id','bike_id']].astype('object')
bike_df[['user_type','member_gender']]= bike_df[['user_type','member_gender']].astype('category')
bike_df['member_birth_year']= bike_df['member_birth_year'].astype('int64')
In [13]:
# checking the unique values for member gender.
bike_df['member_gender'].unique()
Out[13]:
['Male', 'No Gender', 'Other', 'Female']
Categories (4, object): ['Female', 'Male', 'No Gender', 'Other']
In [14]:
# Convert start time to morning, afternoon, and night of day
bike_df['start_time']=pd.to_datetime(bike_df['start_time'])
bike_df['start_hour']=bike_df['start_time'].apply(lambda i : i.hour)
bike_df['period_ofday']='morning'
bike_df['period_ofday'][(bike_df['start_hour'] >=12) & (bike_df['start_hour'] <=17)] = 'afternoon'
bike_df['period_ofday'][(bike_df['start_hour'] >=18) & (bike_df['start_hour'] <=23)] = 'night'
C:\Users\USER\AppData\Local\Temp\ipykernel_2580\1840100055.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bike_df['period_ofday'][(bike_df['start_hour'] >=12) & (bike_df['start_hour'] <=17)] = 'afternoon'
C:\Users\USER\AppData\Local\Temp\ipykernel_2580\1840100055.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bike_df['period_ofday'][(bike_df['start_hour'] >=18) & (bike_df['start_hour'] <=23)] = 'night'
In [15]:
#Testing the start hour and time of day columns created.
print(bike_df['start_hour'].value_counts())
print(bike_df['period_ofday'].value_counts())
17    21864
8     21056
18    16827
9     15903
16    14169
7     10614
19     9881
15     9174
12     8724
13     8551
10     8364
14     8152
11     7884
20     6482
21     4561
6      3485
22     2916
23     1646
0       925
5       896
1       548
2       381
4       235
3       174
Name: start_hour, dtype: int64
afternoon    70634
morning      70465
night        42313
Name: period_ofday, dtype: int64
In [16]:
#convert time period of day into ordered categorical data type.
ordinal_dict = {'period_ofday': ['morning', 'afternoon', 'night']}

for item in ordinal_dict:
    ordered_var = pd.api.types.CategoricalDtype(ordered = True, categories = ordinal_dict[item])
    bike_df[item] = bike_df[item].astype(ordered_var)
In [17]:
# creating the age variable/feature
bike_df['age']=bike_df['member_birth_year'].apply(lambda birth_year : 2019 - birth_year )
In [18]:
# Testing if age variable  has been added to the dataset.
print(bike_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 19 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             183412 non-null  int64         
 1   start_time               183412 non-null  datetime64[ns]
 2   end_time                 183412 non-null  object        
 3   start_station_id         183412 non-null  object        
 4   start_station_name       183412 non-null  object        
 5   start_station_latitude   183412 non-null  float64       
 6   start_station_longitude  183412 non-null  float64       
 7   end_station_id           183412 non-null  object        
 8   end_station_name         183412 non-null  object        
 9   end_station_latitude     183412 non-null  float64       
 10  end_station_longitude    183412 non-null  float64       
 11  bike_id                  183412 non-null  object        
 12  user_type                183412 non-null  category      
 13  member_birth_year        183412 non-null  int64         
 14  member_gender            183412 non-null  category      
 15  bike_share_for_all_trip  183412 non-null  object        
 16  start_hour               183412 non-null  int64         
 17  period_ofday             183412 non-null  category      
 18  age                      183412 non-null  int64         
dtypes: category(3), datetime64[ns](1), float64(4), int64(4), object(7)
memory usage: 22.9+ MB
None

What is the structure of your dataset?¶

The Structure of the dataset¶

The dataset includes $183\,412$ trips and $16$ features namely:

  • duration_sec

  • start_time

  • end_time

  • start_station_id

  • start_station_name

  • start_station_latitude

  • start_station_longitude

  • end_station_id

  • end_station_name

  • end_station_latitude

  • end_station_longitude

  • bike_id

  • user_type

  • member_birth_year

  • member_gender

  • bike_share_for_all_trip

The dataset contains 3 types of datatypes namely:

  • int64

  • object

  • float64

The start_time and end_time features in this dataset were recorded as objects datatype and these now have been changed to datetime. Also, as we are interested in finding out when most trips are taken in terms of time of day, the time variables will be broken down into time of day such as morning, afternoon and evening. With the membership birth year provided, we will use it to compute the ages of the members so as to investigate the relationship of the age with the duration of trip as well as the type of bike user.

What is/are the main feature(s) of interest in your dataset?¶

The following are the main features of interest.

  • duration_sec : The duration of the trip in seconds.
  • member_gender : The gender of the members, whether Male or Female.

  • member_birth_year : The date of birth for each user.

  • start_time : The start time for each trip.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶

The following are the features of interest to help support our investigation.

  • user_type : The type of bike user whether Customer or Subscriber.
  • age : The age of the bike users which will be derived from the member_birth_year variable.
  • period_ofday : The period of time in a day, such as morining, afternoon and evening. Derived from the start_time variable.

  • duration_min : The trip duration in minutes.

Univariate Exploration¶

In this section, we investigate the distributions of individual variables and observe if there are any unusual points or outliers as well as any relationships between variables.

What is the distribution for members' ages?¶

In [19]:
# plotting the distribution for the age of bike users.
plt.figure(figsize=[10, 8],dpi=100)
plt.hist(data = bike_df, x = 'age')
plt.xlabel('Member age')
plt.title('Distibution of Age for the Members')
plt.show()
In [20]:
# plotting the box plot to check the outliers clearly.
plt.figure(figsize=(8,6))
plt.boxplot(bike_df['age']);
plt.xlabel('Member Age (Years)')
plt.ylabel('Frequency')
plt.title('Distribution of Members Age');

We observe that the distribution for members' ages is right skewed, with the majority of users in the age range of $20- 40$. There are also outliers in the age variable as it is not possisble to have a user who is above $100$ years.

In [21]:
# displaying the outliers in the age variable.
bike_df.query('age >= 100')
Out[21]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip start_hour period_ofday age
1285 148 2019-02-28 19:29:17.627 2019-02-28 19:31:45.9670 158.0 Shattuck Ave at Telegraph Ave 37.833279 -122.263490 173.0 Shattuck Ave at 55th St 37.840364 -122.264488 5391 Subscriber 1900 Male Yes 19 night 119
10827 1315 2019-02-27 19:21:34.436 2019-02-27 19:43:30.0080 343.0 Bryant St at 2nd St 37.783172 -122.393572 375.0 Grove St at Masonic Ave 37.774836 -122.446546 6249 Subscriber 1900 Male No 19 night 119
16087 1131 2019-02-27 08:37:36.864 2019-02-27 08:56:28.0220 375.0 Grove St at Masonic Ave 37.774836 -122.446546 36.0 Folsom St at 3rd St 37.783830 -122.398870 4968 Subscriber 1900 Male No 8 morning 119
19375 641 2019-02-26 17:03:19.855 2019-02-26 17:14:01.6190 9.0 Broadway at Battery St 37.798572 -122.400869 30.0 San Francisco Caltrain (Townsend St at 4th St) 37.776598 -122.395282 6164 Customer 1900 Male No 17 afternoon 119
21424 1424 2019-02-26 08:58:02.904 2019-02-26 09:21:47.7490 375.0 Grove St at Masonic Ave 37.774836 -122.446546 343.0 Bryant St at 2nd St 37.783172 -122.393572 5344 Subscriber 1900 Male No 8 morning 119
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
171996 1368 2019-02-03 17:33:54.607 2019-02-03 17:56:42.9490 37.0 2nd St at Folsom St 37.785000 -122.395936 375.0 Grove St at Masonic Ave 37.774836 -122.446546 4988 Subscriber 1900 Male No 17 afternoon 119
173711 993 2019-02-03 09:45:30.464 2019-02-03 10:02:04.1690 375.0 Grove St at Masonic Ave 37.774836 -122.446546 36.0 Folsom St at 3rd St 37.783830 -122.398870 5445 Subscriber 1900 Male No 9 morning 119
177708 1527 2019-02-01 19:09:28.387 2019-02-01 19:34:55.9630 343.0 Bryant St at 2nd St 37.783172 -122.393572 375.0 Grove St at Masonic Ave 37.774836 -122.446546 5286 Subscriber 1900 Male No 19 night 119
177885 517 2019-02-01 18:38:40.471 2019-02-01 18:47:18.3920 25.0 Howard St at 2nd St 37.787522 -122.397405 30.0 San Francisco Caltrain (Townsend St at 4th St) 37.776598 -122.395282 2175 Subscriber 1902 Female No 18 night 117
182830 428 2019-02-01 07:45:05.934 2019-02-01 07:52:14.9220 284.0 Yerba Buena Center for the Arts (Howard St at ... 37.784872 -122.400876 67.0 San Francisco Caltrain Station 2 (Townsend St... 37.776639 -122.395526 5031 Subscriber 1901 Male No 7 morning 118

72 rows × 19 columns

In [22]:
# dropping the outliers in the age variable.
bike_df.drop(bike_df.query('age >= 100').index, inplace= True)
In [23]:
# Checking if the outliers have been removed.
bike_df['age'].describe()
Out[23]:
count    183340.000000
mean         34.016379
std           9.766728
min          18.000000
25%          27.000000
50%          31.000000
75%          38.000000
max          99.000000
Name: age, dtype: float64
In [24]:
# Using the log scale for the histogram of Member age.

We now change the x-axis to log type, and change the axis limit

In [25]:
# checking the descriptive statistics for age on a log scale.
np.log10(bike_df['age'].describe())
Out[25]:
count    5.263257
mean     1.531688
std      0.989749
min      1.255273
25%      1.431364
50%      1.491362
75%      1.579784
max      1.995635
Name: age, dtype: float64
In [26]:
# Axis transformation for age distribution.
bins= 10 ** np.arange(1.26,2+0.025,0.025)
plt.figure(figsize=(10,8))
plt.hist(data=bike_df,x='age',bins=bins)
plt.xscale('log')
plt.title('Age Distribution on a log Scale');

Upon removing the outliers and plotting the distribution for age on a log scale, we observe that the distribution is now a multimodal distribution as it has $3$ peaks.

In [27]:
# plotting the box plot to check if the outliers have been removed.
plt.figure(figsize=(8,6))
plt.boxplot(bike_df['age']);
plt.xlabel('Member Age (Year)')
plt.ylabel('Frequency')
plt.title('Distribution of Members Age');

From the box plot, we see that the outliers have been removed. We only have one outlier which is obove $90$ years.

What is the distribution for member_gender and user_type features?¶

In [28]:
# Plotting the 2 categorical variables user_type and member_gender to get an idea of the distribution of each.
fig, ax = plt.subplots(nrows=2, figsize=[10,10])
plot_color=sb.color_palette()[0]
sb.countplot(data=bike_df, x='user_type',color=plot_color,ax=ax[0])
sb.countplot(data=bike_df, x='member_gender',color=plot_color,ax=ax[1])
plt.show()

According to the plots above, the majority of users are subscribers, while the rest are customers. It is also observed that the male gender are in majority as compared to females. This is obvious we expect that most females are less likely to book a bike for a trip.

What is the distribution for the trip duration in minutes?¶

In [29]:
# we first begin by creating a new column called duration_min.
bike_df['duration_min']=bike_df['duration_sec']/60
In [30]:
# checking the summary statistics for trip duration in minutes.
bike_df['duration_min'].describe()
Out[30]:
count    183340.000000
mean         12.101764
std          29.911955
min           1.016667
25%           5.416667
50%           8.566667
75%          13.266667
max        1424.066667
Name: duration_min, dtype: float64
In [31]:
# plotting the histogram for trip duration in minutes.
bin_size=300
bins=np.arange(0,bike_df['duration_min'].max()+bin_size, bin_size)
plt.figure(figsize=[10,8],dpi=100)
plt.hist(data=bike_df, x='duration_min',bins=bins)
plt.xlabel('duration (min)')
plt.title('Trip duration in Minutes', fontsize=15)
plt.show()

Most of the trips last between $12$ on average to about $300$ minutes. To make this plot more visible we will include the x limits to the plot.

In [32]:
# plotting the histogram with an addition of the xlim.
plt.figure(figsize=[10,8],dpi=200)
plt.hist(data=bike_df, x='duration_min',bins=100)
plt.xlabel('duration (min)')
plt.xlim((0,100));
plt.title('Trip duration in Minutes', fontsize=15)
plt.show()

Upon using the x-limits, it is now clear that most of the trips do not last for that an hour.

During what period of the day are more trips likely to be booked?¶

In [33]:
# plotting the barplot to check the most frequent period of the day when most trips are booked.
freq=bike_df['period_ofday'].value_counts()
order_period=freq.index
plt.figure(figsize=[10,10])
plot_color=sb.color_palette()[0]
sb.countplot(data=bike_df, x='period_ofday',color=plot_color,order=order_period)
plt.show()

Most trips are booked in the afternoon and morning, with few in the night. It is expected that the majority of users would prefer to book in the morning and afternoon as compared to night time.

In [34]:
# plotting the barplot for the period of day with percentages on top.
total_period=bike_df['period_ofday'].value_counts().sum()
plt.figure(figsize=[10,10],dpi=100)
sb.countplot(data=bike_df,y='period_ofday',color='lightgray',order=order_period);
for j in range(freq.shape[0]):
    count=freq[j]
    pct_string='{:0.1f}'.format(100*count/total_period)
    plt.text(count+1,j, pct_string, va="center")
    plt.xlabel('Frequency')
    plt.ylabel('Period of the day')
    plt.title('Period of the day with Most Trips')

From the above plot, we observe that $38.5 \%$ of trips are made in the afternoon, $38.4 \%$ in the morning and $23.1 \%$ during night.

What are the top 5 popular start stations for the trips taken?¶

In [35]:
#Plotting the pie chart for the top 5 popular start stations.
bike_df['start_station_name'].value_counts()[:5].plot(kind='pie',figsize=(10,10),autopct='%0.1F%%');
plt.title('Top Five Start Stations',fontsize=20);
pd.DataFrame(bike_df['start_station_name'].value_counts()[:5])
Out[35]:
start_station_name
Market St at 10th St 3904
San Francisco Caltrain Station 2 (Townsend St at 4th St) 3542
Berry St at 4th St 3051
Montgomery St BART Station (Market St at 2nd St) 2895
Powell St BART Station (Market St at 4th St) 2760

The top five start station names are:

  • Market St at 10th St

  • San Francisco Caltrain Station 2 (Townsend St at 4th St)

  • Berry St at 4th St

  • Montgomery St BART Station (Market St at 2nd St)

  • Powell St BART Station (Market St at 4th St)

We note that the market area is the best place to set up a start station as it is the centre of the business district.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶

For the univariate visualization the first variable of interest that was investigated was the member age. The distribution was found to be right skewed and contained outliers as was seen from the boxplot. These outliers were dropped and some transformations were made on the x=axis by changing axis limit to a log scale. Upon making these transformations, the age distribution was now a multimodal distribution with $3$ peaks.

The member gender and user type distributions showed that the majority of bike clients were subscribers while a few of them were customers. It was also observed that, the Male population was in majority as compared to Female users.

Furthermore, for the trip duration, it was noted that the most of the trips did not last more than 60 minutes. For the period_ofday feature, the time period with most trips was the afternoon with $38.5\%$ of trips and $38.4\%$ in the morning with $23.1\%$ of trips during the night.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶

From the features investigated, there were some unsual distribution such as:

  • The presence of outliers for the age variable, which were dropped.

  • Inappropriate datatypes for most features which were changed to suitable datatypes.

  • Lastly, we had to create some new columns which were useful for our analysis such as age, period_ofday and duration_min which is the trip duration in minutes.

Bivariate Exploration¶

In this section, we investigate relationships between pairs of variables.

For each gender, how long, in minutes, does the trip last?¶

In [36]:
# plotting the barplot for the trip duration for each gender.
plt.figure(figsize=(10,8))
sb.barplot(data=bike_df,x='member_gender',y='duration_min')
plt.xlabel("Member Gender")
plt.ylabel('Trip duration (min)')
plt.title('Duration of trip for each gender in minutes');

From the barplot above, we note that the category with No gender that was provided in the dataset has a longer trip duration time followed by the other gender. We also further note that the Female gender have a slightly longer trip duration time compared to the Male gender.

Is there any correlation between Member age and duration of trip?¶

In [37]:
# Checking the correlation between age and duration_min by plotting the heatmap.
numeric_vars=['age','duration_min']
plt.figure(figsize=(10,8))
sb.heatmap(bike_df[numeric_vars].corr(),annot=True,fmt='.2f',cmap='vlag_r',center=0)
plt.show()

From the heatmap, we observe that there is no correlation between the variables age and trip duration in minutes.

In [38]:
# Scatter plot to Check the correlation between age and duration_min.
plt.figure(figsize=(10,6))
sb.regplot(data=bike_df,  x ='age', y ='duration_min');
plt.xlabel('Age (Years)')
plt.ylabel('Trip duration (min)');
In [39]:
# Checking the average trip duration time.
bike_df['duration_min'].describe()
Out[39]:
count    183340.000000
mean         12.101764
std          29.911955
min           1.016667
25%           5.416667
50%           8.566667
75%          13.266667
max        1424.066667
Name: duration_min, dtype: float64

The scatter plot also verifies that there exist no relationship between age and trip duration in minutes. I was of the thought that the older that user, the more time it would like to complete a given trip as physical enough and speed reduces as one grows order.

What is the relationship between age and user_type?¶

In [40]:
#plotting the relationship between age and user_type with the aid of a violin plot.
plt.figure(figsize = [16, 5],dpi=100)
base_color=sb.color_palette()[0]
plt.subplot(1,2,1)
ax1 = sb.violinplot(data=bike_df, y ='age', x='user_type', color=base_color)
plt.xticks(rotation = 15);

#plotting the relationship between age and user_type with the aid of a boxplot.
plt.subplot(1,2,2)
sb.boxplot(data=bike_df,  y = 'age', x='user_type', color=base_color)
plt.xticks(rotation = 15);
plt.ylim(ax1.get_ylim());

From the boxplot and violin plots above, we see that the median for Customer user type is slighly less than that of the subscriber. We also note that the subscriber has a higher maximum age with more extreme outliers as compared to the customer user type.

What is the relationship between the 3 categorical variables period_ofday, user_type and member_gender?¶

In [41]:
# plotting the relationship between the categorical varialbes  period_ofday, user_type and member_gender,
plt.figure(figsize=(10,10),dpi=100)

#Subplot 1: period_ofday vs user_type
plt.subplot(3,1,1)
sb.countplot(data=bike_df, x='period_ofday',hue='user_type',palette='Blues')


#Subplot 2: period_ofday vs member_gender
ax=plt.subplot(3,1,2)
sb.countplot(data=bike_df, x='period_ofday',hue='member_gender',palette='Blues')
ax.legend(ncol=2)

#Subplot 3: user_type vs member_gender
ax=plt.subplot(3,1,3)
sb.countplot(data=bike_df, x='user_type',hue='member_gender',palette='Reds')
ax.legend(loc=2, ncol=2)
plt.show()

There are more subscriber user types in all the $3$ periods of the day as compared to customer user types.

We also note that there are less Female users during the $3$ periods of the day as opposed to Male users which are in majority.

From the relationship between user_type and gender, we note that there are very few Female customers and subscribers as compared to the Male population in both user_types. We further note that there are few Male customer user type compared to Male Subscribers

In [42]:
def chart_plot(x):
    """This function groups a certain feature (x) by user_type and plots a pie chart representing the % of user types
       in relation to the given feature.
    """ 
    fea_stat=bike_df.groupby([x],as_index=False)["user_type"].count()
    chart_plot=px.pie(fea_stat,names=x,values='user_type',color_discrete_sequence=px.colors.sequential.RdBu,width=700,height=600 )
    return chart_plot 
In [43]:
chart_plot('member_gender')

A large percentage of bike users are Males.

In [44]:
chart_plot('period_ofday')

The afternoon and morning period of day are the peak periods of the day with more users for both customer and Subscriber user types.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶

Some observations on the Bivariate exploration are:

  • The Male gender has a slightly shorter trip duration time compared to the Females.

  • Subscribers tend to have a shorter trip duration than customers.

  • For the both user types, the male gender are in majority.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶

It was interesting to note that the Male gender were in majority across all the $3$ periods of the day namely morning, afternoon and night. We also noted that, for the $3$ periods of the day, there were more subscribers than customers.

Surprisingly, it was observed that there was no correlation between age and duration of the trip. Logically, one might be of the idea that the users who are old will have a longer trip duration.

Multivariate Exploration¶

In this section, we wil Create plots of three or more variables to investigate further relationships between the different variables of interest.

For each period of the day, what is the average trip duration in minutes for each user type?¶

In [45]:
# plotting the period of day trip duration for each user type using the clustered barplot.
plt.figure(figsize=[10,8],dpi=100)
plot=sb.barplot(data=bike_df, x = 'period_ofday', y='duration_min', hue='user_type',ci=None)
plot.set(xlabel="Period of day", ylabel='Trip duration (min)')
plt.title('Period of day duration usage');

For the morning, afternoon and night periods of the day, we note that the trip duration is longer for customer user type compared to the subscribers. With the afternoon period having the hightest trip duration time for customer user type.

What is the relationship between member_gender, age and duration_min?¶

In [46]:
# Plotting the relationship between member_gender, age and duration_min.
plt.figure(figsize=[10,10],dpi=200);
sb.scatterplot(data=bike_df ,x='age',y='duration_min',hue= 'member_gender', linewidth =0);                                                                                                

We note the following from the above scatter plot:

  • The trip duration for the majority of the Male gender is between $0-200$ minutes as this is were most of the points are clustered.

  • They are relatively less Female users looking at the blue points indicating the female gender and we can also see that the trip duration time is longer for most Female users compared to the Male gender.

  • For the age range from $80-100$ we observe that the trip duration is nearly zero, which suggests that these might be outliers.

How closely correlated are the different variables in the dataset?¶

In [47]:
# plotting the correlation matrix for all the variables in the dataset.
plt.figure(figsize=(10,8))
sb.heatmap(bike_df.corr(),annot=True,fmt='.2f',cmap='vlag_r',center=0)
plt.show()

The above heatmap, shows the correlation between each feature in the dataset. There exist a strong correlation between start station longitude and end station longitude. We also see that there is no correlation for any of the features or variables with the time variables duration_sec and duration_min. Furthermore, we observe that there is a weak correlation between age and member_birth_year.

What is the relationship between member_gender, age and user_type?¶

In [48]:
# Plotting the relationship between member_gender, age and user_type using a pointplot.
fig = plt.figure(figsize = [10,8],dpi=100)
ax = sb.pointplot(data = bike_df, x = 'member_gender', y = 'age', hue = 'user_type',
           palette = 'Blues', linestyles = '', dodge = True,errorbar="sd")
plt.title('Members\' age across gender and user types')
plt.xlabel('Member Gender')
plt.ylabel(' Age (Year)')
plt.show();

The pointplot above shows how the relationship between the user type variable changes across the member gender variable in relation to member age.

We observe that the Female gender has a lower value for the mean age for the customer user type compared to the subscriber. For the Male gender, the mean age for the customer user type is greater than that of the female gender. We also observe that, for the Male gender, the mean age for subscriber has a higher value than the females. We further note that, for other gender, the mean age for both customer and subscriber are greater than the rest of the gender types. Generally, the customer user type has a lower mean age across all genders as compared to the subscriber.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶

From the Multivariate visualization, the following were some of the relationships observed:

  • The customer user type tend to use the bike services more in the afternoon and have a higher trip duration time compared to subscriber user type.

  • The subscriber user type has a higher mean age across all gender types compared to the customer user type.

  • There is a strong correlation between start station longitude and end station longitude. Thus, the location of the user types play a row in determing the best marketing strategy.

Were there any interesting or surprising interactions between features?¶

It was suprising to note that the member age between $80-100$ had a lower trip duration and also that the other gender had a higher mean age for both customer and subscriber user type.

Interestly and as expected there are relatively less female bike users compared to the male population. Also, the Female gender have a slightly higher trip duration compared to the Male gender.

Conclusions¶

  • The most popular top $5$ pick up locations are:
 * Market St at 10th St

 * San Francisco Caltrain Station 2 (Townsend St at 4th St) 

 * Berry St at 4th St

 * Montgomery St BART Station (Market St at 2nd St) 

 * Powell St BART Station (Market St at 4th St)
  • On average, the trip duration is about 12 Minutes and most trips do not last longer than 60 Minutes.
  • The most prominent user type is the subscriber.
  • A large percentage of bike users are Males across all user types.
  • The afternoon and morning periods of the day are the peak periods for both user types.